R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.0
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.4.1 fastmap_1.2.0 cli_3.6.3
[5] tools_4.4.1 htmltools_0.5.8.1 rstudioapi_0.16.0 yaml_2.3.10
[9] rmarkdown_2.28 knitr_1.48 jsonlite_1.8.9 xfun_0.49
[13] digest_0.6.37 rlang_1.1.4 evaluate_1.0.1
Q1. Git/GitHub
No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).
Create a private repository biostat-203b-2025-winter and add Hua-Zhou and TA team (Tomoki-Okuno for Lec 1; parsajamshidian and BowenZhang2001 for Lec 82) as your collaborators with write permission.
Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (Quarto file qmd, html file converted by Quarto, all code and extra data sets to reproduce results) in the main branch.
After each homework due date, course reader and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.
After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
Solution Q1 is completed.
Q2. Data ethics training
This exercise (and later in this course) uses the MIMIC-IV data v3.1, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. You must complete Q2 before working on the remaining questions. (Hint: The CITI training takes a few hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)
Solution Here is the link to my Completion Report. Here is the link to my Completion Certification.
Q3. Linux Shell Commands
Make the MIMIC-IV v3.1 data available at location ~/mimic. The output of the ls -l ~/mimic command should be similar to the below (from my laptop).
# content of mimic folder# ls -l ~/mimic/
Refer to the documentation https://physionet.org/content/mimiciv/3.1/ for details of data files. Do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.
Use Bash commands to answer following questions.
Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
Briefly describe what Bash commands zcat, zless, zmore, and zgrep do.
(Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gzdols-l$datafiledone
Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)
Display the first few lines of admissions.csv.gz. How many rows are in this data file, excluding the header line? Each hadm_id identifies a hospitalization. How many hospitalizations are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)
What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables in decreasing order. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, sort, and so on; skip the header line.)
The icusays.csv.gz file contains all the ICU stays during the study period. How many ICU stays, identified by stay_id, are in this data file? How many unique patients, identified by subject_id, are in this data file?
To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)
Q4. Who’s popular in Price and Prejudice
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
Explain what wget -nc does. Do not put this text file pg42671.txt in Git. Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands.
wget-nc-q http://www.gutenberg.org/cache/epub/42671/pg42671.txtfor char in Elizabeth Jane Lydia Darcydoecho$char:grep-o-i"\b$char\b" pg42671.txt |wc-l|awk'{print $1}'done
What’s the difference between the following two commands?
echo'hello, world'> test1.txt
and
echo'hello, world'>> test2.txt
Difference 1. > means overwrite. If the file test1.txt already exists, this command empties it and writes ‘hello, world’ to it. If test1.txt does not exist, this command creates a new file, test1.txt, and writes ‘hello, world’ to it. 2. >> means append. If the file test2.txt already exists, this command adds ‘hello, world’ to the end of the file without deleting the original contents. If test2.txt does not exist, the command creates a new file, test2.txt, and writes ‘hello, world’ to it.
Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:
#!/bin/sh# Select lines from the middle of a file.# Usage: bash middle.sh filename end_line num_lineshead-n"$2""$1"|tail-n"$3"
Using chmod to make the file executable by the owner, and run
chmod +x middle.sh./middle.sh pg42671.txt 20 5
Explain the output. Explain the meaning of "$1", "$2", and "$3" in this shell script. Why do we need the first line of the shell script?
Solution 1. What this script does The script middle.sh is used to extract a specific number of lines in the middle from a file. The user provides the following parameters: File name; The last line to be read (end_line); Number of lines to extract (num_lines).
head -n ‘$2’ ‘$1’: outputs the first 2 lines of file $1. tail -n ‘$3’: extracts the last $3 lines from the output of head. So, it extracts the last 3 lines from the first 2 lines of the file, and finally achieves the function of extracting specific lines from the middle of the file.
’$1”: filename. ’$2”: number of lines to pass to head. ’$3”: number of lines to pass to tail.
The first line of the script is: #!/bin/sh This is ‘shebang’. It tells the operating system which interpreter to use to run the script. In this case it is /bin/sh (the shell interpreter). Without this line of code, the script may not run correctly, or it may rely on the user’s default shell (e.g. bash, zsh), which may be different for different systems.
chmod +x middle.sh: Grants permission to execute the script (adds ‘executable’ permissions). . /middle.sh pg42671.txt 20 5: Run the script and pass parameters: File name: pg42671.txt. Reads from the first 20 lines of the file. Output the last 5 of these 20 lines instead.
Output: This command extracts lines 16-20 of pg42671.txt. outputs the first 20 lines (head -n 20 pg42671.txt). outputs the last 5 lines from these 20 lines (tail -n 5).
Q5. More fun with Linux
Try following commands in Bash and interpret the results: cal, cal 2025, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.
cal
Displays the calendar of the current month. Output A formatted calendar of the current month with the current date marked.
cal 2025
2025
January February March
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 1 1
5 6 7 8 9 10 11 2 3 4 5 6 7 8 2 3 4 5 6 7 8
12 13 14 15 16 17 18 9 10 11 12 13 14 15 9 10 11 12 13 14 15
19 20 21 22 23 24 _2_5 16 17 18 19 20 21 22 16 17 18 19 20 21 22
26 27 28 29 30 31 23 24 25 26 27 28 23 24 25 26 27 28 29
30 31
April May June
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 1 2 3 1 2 3 4 5 6 7
6 7 8 9 10 11 12 4 5 6 7 8 9 10 8 9 10 11 12 13 14
13 14 15 16 17 18 19 11 12 13 14 15 16 17 15 16 17 18 19 20 21
20 21 22 23 24 25 26 18 19 20 21 22 23 24 22 23 24 25 26 27 28
27 28 29 30 25 26 27 28 29 30 31 29 30
July August September
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 1 2 1 2 3 4 5 6
6 7 8 9 10 11 12 3 4 5 6 7 8 9 7 8 9 10 11 12 13
13 14 15 16 17 18 19 10 11 12 13 14 15 16 14 15 16 17 18 19 20
20 21 22 23 24 25 26 17 18 19 20 21 22 23 21 22 23 24 25 26 27
27 28 29 30 31 24 25 26 27 28 29 30 28 29 30
31
October November December
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 1 1 2 3 4 5 6
5 6 7 8 9 10 11 2 3 4 5 6 7 8 7 8 9 10 11 12 13
12 13 14 15 16 17 18 9 10 11 12 13 14 15 14 15 16 17 18 19 20
19 20 21 22 23 24 25 16 17 18 19 20 21 22 21 22 23 24 25 26 27
26 27 28 29 30 31 23 24 25 26 27 28 29 28 29 30 31
30
Displays the calendar of the entire year 2025. Output A grid with all the months in 2025.
cal 9 1752
September 1752
Su Mo Tu We Th Fr Sa
1 2 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Displays the calendar for September 1752.
date
Sat Jan 25 01:47:55 EST 2025
Displays the current date and time. Output: Current system time in the format
hostname
Justinas-MacBook-Pro.local
Shows the name of the computer or network hostname. Output: The hostname of your machine.
arch
arm64
Displays the architecture of your system (e.g., x86_64, arm64). Output: System architecture.
uname-a
Darwin Justinas-MacBook-Pro.local 23.0.0 Darwin Kernel Version 23.0.0: Fri Sep 15 14:41:34 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T8103 arm64
Displays detailed information about the system kernel, architecture, and operating system.
Shows the user ID (UID) and group ID (GID) of the current user.
last|head
justinatian console Wed Jan 22 17:36 still logged in
justinatian console Sun Jan 12 11:34 - 06:41 (19:07)
justinatian console Sat Jan 11 21:59 - 02:55 (04:56)
justinatian ttys000 Sat Jan 11 18:31 - 18:31 (00:00)
justinatian ttys000 Sat Jan 11 17:26 - 17:26 (00:00)
justinatian console Sat Jan 11 16:01 - 20:27 (04:25)
justinatian console Sat Jan 11 11:51 - 15:59 (04:07)
justinatian console Fri Jan 10 00:55 - 03:30 (02:35)
justinatian ttys000 Thu Jan 9 22:56 - 22:56 (00:00)
justinatian ttys000 Thu Jan 9 22:56 - 22:56 (00:00)
Displays the login history of users. | head limits the output to the first 10 lines. Output: A list of recent logins
Generates all possible combinations of words formed by the braces.
time sleep 5
real 0m5.008s
user 0m0.000s
sys 0m0.002s
Measures the time taken to execute the command sleep 5, which pauses for 5 seconds. Output: Real, user, and system time for the command execution
history|tail
Displays the last 10 commands in the command history. Output: List of the last 10 commands you executed in the terminal.
Solution Q5 is completed.
Q6. Book
Git clone the repository https://github.com/christophergandrud/Rep-Res-Book for the book Reproducible Research with R and RStudio to your local machine. Do not put this repository within your homework repository biostat-203b-2025-winter.
Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book directly. For pdf_book, I needed to add a line \usepackage{hyperref} to the file Rep-Res-Book/rep-res-3rd-edition/latex/preabmle.tex.)
The point of this exercise is (1) to obtain the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way. Use sudo apt install PKGNAME to install required Ubuntu packages and tlmgr install PKGNAME to install missing TexLive packages.
For grading purpose, include a screenshot of Section 4.1.5 of the book here.
Solution ::: {figure} :::
html screenshot
The Books app on macOS can open the EPUB file, so I used it and took a screenshot to demonstrate my work.